Multi-register-based speech detection method and related apparatus, and storage medium

ABSTRACT

This application discloses a multi-sound area-based speech detection method and related apparatus, and a storage medium, which is applied to the field of artificial intelligence. The method includes: obtaining sound area information corresponding to each sound area in N sound areas; using the sound area as a target detection sound area, and generating a control signal corresponding to the target detection sound area according to sound area information corresponding to the target detection sound area; processing a speech input signal corresponding to the target detection sound area by using the control signal corresponding to the target detection sound area, to obtain a speech output signal corresponding to the target detection sound area; and generating a speech detection result of the target detection sound area according to the speech output signal corresponding to the target detection sound area. Speech signals in different directions are processed in parallel based on a plurality of sound areas, so that in a multi-sound source scenario, the speech signals in different directions may be retained or suppressed by a control signal, to separate and enhance speech of a target detection user in real time, thereby improving the accuracy of speech detection.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of PCT Patent ApplicationNo. PCT/CN2021/100472, entitled “VOICE DETECTION METHOD BASED ONMULTIPLE SOUND REGIONS, RELATED DEVICE, AND STORAGE MEDIUM” filed onJun. 17, 2021, which claims priority to Chinese Patent Application No.202010732649.8, filed with the State Intellectual Property Office of thePeople's Republic of China on Jul. 27, 2020, and entitled“MULTI-REGISTER-BASED SPEECH DETECTION METHOD AND RELATED APPARATUS, ANDSTORAGE MEDIUM”, all of which are incorporated herein by reference intheir entirety.

FIELD OF THE TECHNOLOGY

This application relates to the field of artificial intelligence, and inparticular, to a speech detection technology.

BACKGROUND OF THE DISCLOSURE

With the wide application of far-field speech in people's daily life,performing processing such as voice activity detection (VAD),separation, enhancement, recognition, calling, and the like on eachpossible sound source in a multi-sound source (or multi-user) scenariohas become a bottleneck of a plurality of types of intelligent speechproducts in improving the voice interaction performance.

A mono pre-processing system based on a main speaker detection algorithmis designed in the conventional technical solution. The pre-processingsystem generally estimates a speaker having the most powerful signalenergy (that is, signal energy reaching a microphone array) and anazimuth thereof by estimating the azimuth in combination with the signalstrength or estimating the azimuth in combination with the spatialspectrum, and determines the speaker and the azimuth as a main speakerand an azimuth thereof.

However, when a plurality of speakers exist in the environment, sincethe main speaker may be farther away from the microphone array relativeto an interference speaker, determining the main speaker only accordingto the signal strength may be flawed. Although the volume of the mainspeaker may be higher than the interference speaker, a speech signal ofthe main speaker has a greater propagation loss in the space, and asignal strength reaching the microphone array may be lower, resulting ina poor effect in subsequent speech processing.

SUMMARY

Embodiments of this application provide a multi-sound area-based speechdetection method and related apparatus, and a storage medium. Accordingto an aspect of this application, a multi-sound area-based speechdetection method is provided, performed by a computer device, the methodincluding:

obtaining sound area information corresponding to each sound area in Nsound areas, the sound area information including a sound areaidentifier, a sound pointing angle, and user information, the sound areaidentifier being used for identifying a sound area, the sound pointingangle being used for indicating a central angle of the sound area, theuser information being used for indicating a user existence situation inthe sound area, N being an integer greater than 1;

using each sound area as a target detection sound area, and generating acontrol signal corresponding to the target detection sound areaaccording to the sound area information corresponding to the targetdetection sound area, the control signal being used for performingsuppression or retention on a speech input signal corresponding to thetarget detection sound area;

processing the speech input signal corresponding to the target detectionsound area by using the control signal corresponding to the targetdetection sound area, to obtain a speech output signal corresponding tothe target detection sound area; and

generating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area.

According to another aspect of this application, a speech detectionapparatus is provided, deployed on a computer device, the apparatusincluding:

an obtaining module, configured to obtain sound area informationcorresponding to each sound area in N sound areas, the sound areainformation including a sound area identifier, a sound pointing angle,and user information, the sound area identifier being used foridentifying a sound area, the sound pointing angle being used forindicating a central angle of the sound area, the user information beingused for indicating a user existence situation in the sound area, Nbeing an integer greater than 1;

a generation module, configured to use each sound area as a targetdetection sound area, and generating a control signal corresponding tothe target detection sound area according to the sound area informationcorresponding to the target detection sound area, the control signalbeing used for performing suppression or retention on a speech inputsignal corresponding to the target detection sound area;

a processing module, configured to process the speech input signalcorresponding to the target detection sound area by using the controlsignal corresponding to the target detection sound area, to obtain aspeech output signal corresponding to the target detection sound area;and

the generation module being further configured to generate a speechdetection result of the target detection sound area according to thespeech output signal corresponding to the target detection sound area.

According to another aspect of this application, a computer device isprovided, including: a memory, a transceiver, a processor, and a bussystem,

the memory being configured to store a program,

the processor being configured to execute the program in the memory, andperform the method in the foregoing aspects according to instructions inprogram code; and

the bus system being configured to connect the memory and the processor,to cause the memory and the processor to perform communication.

According to another aspect of this application, a non-transitorycomputer-readable storage medium is provided, the computer-readablestorage medium storing instructions that, when executed by a processorof a computer, cause the computer to perform the method in the foregoingaspects.

According to another aspect of this application, a computer programproduct or a computer program is provided, the computer program productor the computer program including computer instructions stored in acomputer-readable storage medium. A processor of a computer device readsthe computer instructions from the computer-readable storage medium, andexecutes the computer instructions, so that the computer device performsthe method provided in the various implementations in the foregoingaspects.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of an environment of a scenario based on amulti-user conference according to an embodiment of this application.

FIG. 2 is a schematic diagram of an embodiment of a speech detectionsystem according to an embodiment of this application.

FIG. 3 is a schematic diagram of an embodiment of a multi-soundarea-based speech detection method according to an embodiment of thisapplication.

FIG. 4 is a schematic diagram of a multi-sound area division manneraccording to an embodiment of this application.

FIG. 5 is a schematic architectural diagram of a multi-channel soundpickup system according to an embodiment of this application.

FIG. 6 is another schematic architectural diagram of a multi-channelsound pickup system according to an embodiment of this application.

FIG. 7 is a schematic diagram of an interface of implementing calling byusing a multi-sound area-based speech detection method according to anembodiment of this application.

FIG. 8 is another schematic architectural diagram of a multi-channelsound pickup system according to an embodiment of this application.

FIG. 9 is a schematic diagram of an interface of implementing dialogresponding implemented by using a multi-sound area-based speechdetection method according to an embodiment of this application.

FIG. 10 is another schematic architectural diagram of a multi-channelsound pickup system according to an embodiment of this application.

FIG. 11 is a schematic diagram of implementing text recording by using amulti-sound area-based speech detection method according to anembodiment of this application.

FIG. 12 is a schematic diagram of an embodiment of a speech detectionapparatus according to an embodiment of this application.

FIG. 13 is a schematic structural diagram of a computer device accordingto an embodiment of this application.

DESCRIPTION OF EMBODIMENTS

Embodiments of this application provide a multi-sound area-based speechdetection method and related apparatus, and a storage medium. In amulti-sound source scenario, speech signals in different directions maybe retained or suppressed by a control signal, so that speech of eachuser can be separated and enhanced in real time, thereby improving theaccuracy of speech detection and improving the effect of speechprocessing.

It is to be understood that, the multi-sound area-based speech detectionmethod provided in this application can perform speech recognition andsemantic recognition according to a case that a plurality of userssimultaneously speak, and then determine which user to respond to. Acase that a plurality of users speak is likely to occur in a far-fieldrecognition scenario. For example, a case that a plurality of usersimultaneously speak may occur in a conference room, a car, or a roomwith a smart home, and as a result, a multi-source signal may interferewith detection. The multi-sound area-based speech detection methodprovided in this application can resolve the problem of signalinterference existing in the foregoing scenario. For example, a casethat a plurality of users in a surrounding environment simultaneouslyspeak often occurs in a wake-up-free scenario of a smart speakerproduct. In view of this, according to the method provided in thisapplication, which user to respond to is first determined, thenrecognition is performed on speech content of the user in terms ofcontent and intention, and whether to respond to a voice command of theuser is determined by the smart speaker product according to arecognition result.

For ease of understanding, the speech detection method provided in thisapplication is described below with reference to a specific scenario.Referring to FIG. 1 , FIG. 1 is a schematic diagram of an environment ofa scenario based on a multi-user conference according to an embodimentof this application. As shown in the figure, in a far-field conferencescenario, there may be a plurality of participants in a conference roomat the same time, for example, a user 1, a user 2, a user 3, a user 4, auser 5, and a user 6. A conference system may include a screen, acamera, and a microphone array. The microphone array is configured tocollect speech of six users, the camera is configured to capturereal-time pictures of the six users, and the screen may display thepictures of the six users and display information related to theconference. For a call application, a main speaker in a conferencescenario (in which there may usually be one or two main speakers) needsto be determined in real time, and speech of the main speaker isenhanced and transmitted to a remote end to which a call is connected.In addition, in the conference scenario, for a conference transcriptionfunction, whether each user speaks needs to be determined in real time,so that speech of a speaker is separated and enhanced and is transmittedto an automatic speech recognition (ASR) service module in a cloud, andspeech content is recognized by using the ASK service module.

The multi-sound area-based speech detection method provided in thisapplication is applicable to a speech detection system shown in FIG. 2 .Referring to FIG. 2 , FIG. 2 is a schematic diagram of an embodiment ofa speech detection system according to an embodiment of thisapplication. As shown in the figure, the speech detection systemincludes a server and a terminal device. The server involved in thisapplication may be an independent physical server, or may be a servercluster or a distributed system formed by a plurality of physicalservers, or may be a cloud server that provides basic cloud computingservices such as a cloud service, a cloud database, cloud computing, acloud function, cloud storage, a network service, cloud communication, amiddleware service, a domain name service, a security service, a contentdelivery network (CDN), big data, and an AI platform. The terminaldevice may be a smart television, a smart speaker, a smartphone, atablet computer, a notebook computer, a palmtop computer, a personalcomputer, or the like. This application is not limited thereto.

In the speech detection system, the terminal device may communicate withthe server through a wireless network, a wired network, or a movablestorage medium. The foregoing wireless network uses a standardcommunication technology and/or protocol. The wireless network isusually the Internet, but may alternatively be any another network,including but not limited to, a Bluetooth, a local area network (LAN), ametropolitan area network (MAN), a wide area network (WAN), or anycombination of a mobile network, a dedicated network, or a virtualdedicated network. In some embodiments, custom or dedicated datacommunication technologies may be used in place of or in addition to theforegoing data communication technologies. The movable storage mediummay be a universal serial bus (USB) flash drive, a removable hard disk,or another movable storage medium. This is not limited in thisapplication. Although FIG. 2 merely shows five types of terminaldevices, it is to be understood that, examples in FIG. 2 are merely usedto understand the technical solution, and are not to be construed as alimitation on this application.

Based on the speech detection system shown in FIG. 2 , the microphonearray equipped on the terminal device picks up a speech signal and othersounds in an environment and then transmits a collected digital signalto a pre-processing module of the speech signal, and the pre-processingmodule perform processing such as extraction, enhancement, VADdetection, speaker detection, and main speaker detection on targetspeech. Specific processing content is flexibly determined according toa scenario and functional requirements. A speech signal enhanced by thepre-processing module may be sent to the server, and a speechrecognition module or a voice call module deployed in the serverprocesses the enhanced speech signal.

The multi-sound area-based speech detection method provided in thisapplication is described below with reference to the foregoingdescription. Referring to FIG. 3 , an embodiment of the multi-soundarea-based speech detection method provided in the embodiments of thisapplication includes the following steps:

101: Obtain sound area information corresponding to each sound area in Nsound areas, the sound area information including a sound areaidentifier, a sound pointing angle, and user information, the sound areaidentifier being used for identifying a sound area, the sound pointingangle being used for indicating a central angle of the sound area, theuser information being used for indicating a user existence situation inthe sound area, N being an integer greater than 1.

In this embodiment, a space within a visual range may first be dividedinto N sound areas. For ease of description, referring to FIG. 4 , FIG.4 is a schematic diagram of a multi-sound area division manner accordingto an embodiment of this application. As shown in the figure, assumingthat a 360-degree space can be divided into 12 sound areas, each soundarea is 30 degrees, and a central angle of the sound area isθ_(i=1, . . . , N), for example, θ₁=15 degrees, θ₂=45 degrees, and θ₃=75degrees. The rest can be deduced by analogy. FIG. 4 is merely anexample, in actual application, N is an integer greater than or equal to2, such as 12, 24, 36, and the like. The division quantity differsaccording to the amount of computation. In addition, the space withinthe visual range may further be non-uniformly divided. This is notlimited herein. Each sound area corresponds to one sound source,assuming that there may be two or more users in a specific sound area,such users may also be considered as a same person. Therefore, duringactual sound area division, each sound area may be divided as finely aspossible.

A speech detection apparatus may obtain sound area informationcorresponding to each sound area after the sound area division iscompleted. The sound area information includes a sound area identifier,a sound pointing angle, and user information, for example, sound areainformation of the first sound area may be represented as {(i, θ,λ_(i))}_(i=1), sound area information of the second sound area may berepresented as {(i, θ, λ_(i))}_(i=2), and the rest can be deduced byanalogy. i represents an i^(th) sound area, θ_(i) represents a soundpointing angle corresponding to the i^(th) sound area, and λ_(i)represents user information corresponding to the i^(th) sound area. Theuser information is used for indicating a user existence situation inthe sound area, for example, assuming that no user exists in the i^(th)sound area, λ_(i) may be set to −1; and assuming that a user exists inthe i^(th) sound area, λ_(i) may be set to 1.

The method provided in the embodiments of this application may beperformed by a computer device, and specifically may be performed by thespeech detection apparatus deployed on the computer device. The computerdevice may be a terminal device, or may be a server, that is, the speechdetection apparatus may be deployed on the terminal device, or may bedeployed on the server. Certainly, the speech detection apparatus mayalso be deployed in the speech detection system, that is, the speechdetection apparatus may implement the method provided in thisapplication based on a multi-channel sound pickup system.

102: Use the sound area as a target detection sound area, and generate acontrol signal corresponding to the target detection sound areaaccording to sound area information corresponding to the targetdetection sound area, the control signal being used for performingsuppression or retention on a speech input signal, the control signaland the sound area being in a one-to-one correspondence.

In this embodiment, after obtaining sound area information correspondingto each sound area in N sound areas, the speech detection apparatus mayuse the sound area as a target detection sound area, and generate acontrol signal corresponding to the target detection sound areaaccording to sound area information corresponding to the targetdetection sound area, to generate a control signal corresponding to thesound area, where the control signal may suppress or retain a speechinput signal obtained through the microphone array.

Assuming that it is detected that no user exists in the i^(th) soundarea, it indicates that a speech input signal in the sound area belongsto noise (abnormal human voice). Therefore, a control signal generatedfor the sound area may perform suppression on the speech input signal.Assuming that it is detected that a user exists in the i^(th) sound areaand a speech input signal in the sound area belongs to normal humanvoice, a control signal generated for the sound area may performretention on the speech input signal.

Whether a user exists in a sound area may be detected by using acomputer vision (CV) technology, or whether a user exists in a currentsound area may be estimated by using a spatial spectrum.

103: Process a speech input signal corresponding to the target detectionsound area by using the control signal corresponding to the targetdetection sound area, to obtain a speech output signal corresponding tothe target detection sound area, the control signal, the speech inputsignal, and the speech output signal being in a one-to-onecorrespondence.

In this embodiment, after obtaining the control signal corresponding tothe sound area from the N sound areas, the speech detection apparatusmay still use the sound area as the target detection sound area, andprocess the speech input signal corresponding to the target detectionsound area by using the control signal corresponding to the targetdetection sound area, to obtain the speech output signal correspondingto the target detection sound area, and in other words, suppression orretention is performed on a speech input signal in a corresponding soundarea by using the control signal corresponding to the sound area,thereby outputting a speech output signal corresponding to the soundarea. For example, when no user exists in the i^(th) sound area, acontrol signal of the i^(th) sound area may be “0”, that is, suppressionis performed on a speech input signal corresponding to the sound area.In another example, when a user normally speaking exists in the i^(th)sound area, the control signal corresponding to the i^(th) sound areamay be “1”, that is, retention is performed on the speech input signalcorresponding to the sound area. Further, processing such as extraction,separation, and enhancement may be performed on the speech input signalcorresponding to the sound area.

104: Generate a speech detection result of the target detection soundarea according to the speech output signal corresponding to the targetdetection sound area.

In this embodiment, to improve the quality of the speech output signal,the speech detection apparatus may further perform post-processing onthe speech output signal corresponding to the sound area, that is, usethe sound area as the target detection sound area, and generate thespeech detection result corresponding to the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area. For example, processing such as cross-channelpost-processing and noise reduction post-processing are performed on thespeech output signal corresponding to the target detection sound area,and the post-processed speech output signal is detected to finallygenerate a speech detection result corresponding to the sound area, andfurther determine to whether to respond to voice from the sound area. Insome cases, the speech detection apparatus may detect whether each soundarea meets a human voice matching condition, assuming that the i^(th)sound area meets the human voice matching condition, a speech detectionresult corresponding to the i^(th) sound area may be that “a user existsin the i^(th) sound area”. It is further assumed that the i^(th) soundarea does not meet the human voice matching condition, the speechdetection result corresponding to the i^(th) sound area is that “no userexists in the i^(th) sound area”.

In this application, speech detection may be implemented based on themulti-channel sound pickup system. Referring to FIG. 5 , FIG. 5 is aschematic architectural diagram of a multi-channel sound pickup systemaccording to an embodiment of this application. As shown in the figure,the microphone array equipped on the terminal device may pick up anaudio signal corresponding to each sound area, where the audio signalincludes a speech input signal and a noise signal. A control signalcorresponding to each sound area is generated by a signal separator,suppression or retention is performed on a speech input signal of acorresponding sound pointing angle by using the control signalcorresponding to the sound area, and then cross-channel post-processingand noise reduction post-processing are performed on each speech outputsignal, to obtain a target speech output signal corresponding to thesound area. Finally, a speech detection result is determined based onsound area information and the target speech output signal of each soundarea, that is, a speech detection result of each sound area in N soundareas is obtained.

An embodiment of this application provides a multi-sound area-basedspeech detection method is provided. Firstly, sound area informationcorresponding to each sound area in N sound areas is obtained, the soundarea information including a sound area identifier, a sound pointingangle, and user information, so that the sound area may be used as atarget detection sound area, and a control signal corresponding to thetarget detection sound area is generated according to sound areainformation corresponding to the target detection sound area; then, aspeech input signal corresponding to the target detection sound area isprocessed by using the control signal corresponding to the targetdetection sound area, to obtain a speech output signal corresponding tothe target detection sound area; and finally a speech detection resultof the target detection sound area is generated according to the speechoutput signal corresponding to the target detection sound area, so thatthe speech detection result corresponding to the sound area is obtained,thereby facilitating determining, according to the speech detectionresult, whether to respond to a user corresponding to the sound area. Inthe foregoing manner, speech signals in different directions areprocessed in parallel based on a plurality of sound areas, so that in amulti-sound source scenario, speech signals in different directions maybe retained or suppressed by a control signal, so that speech of eachuser can be separated and enhanced in real time, thereby improving theaccuracy of speech detection and improving the effect of subsequentspeech processing.

Based on the embodiment corresponding to FIG. 3 , in an exemplaryembodiment provided in the embodiments of this application, theobtaining sound area information corresponding to each sound area in Nsound areas may include the following steps:

detecting the sound area in the N sound areas, to obtain a userdetection result corresponding to the sound area;

using the sound area as the target detection sound area, and determininguser information corresponding to the target detection sound areaaccording to a user detection result corresponding to the targetdetection sound area;

determining lip motion information corresponding to the target detectionsound area according to the user detection result corresponding to thetarget detection sound area;

obtaining a sound area identifier corresponding to the target detectionsound area and a sound pointing angle corresponding to the targetdetection sound area; and

generating the sound area information corresponding to the targetdetection sound area according to the user information corresponding tothe target detection sound area, the lip motion informationcorresponding to the target detection sound area, the sound areaidentifier corresponding to the target detection sound area, and thesound pointing angle corresponding to the target detection sound area.

A manner of obtaining sound area information based on the CV technologyis described in this embodiment. Generally, a corresponding camera needsto be configured to capture a picture of the user, the camera may becovered by one wide-angle camera, and for a 360-degree space, the camerais fully covered by two or three wide-angle cameras in a spliced manner.Each user in the space may be detected and numbered by using the CVtechnology, and related information may further be provided, forexample, user identity information, a face azimuth, lip motioninformation, a facial orientation, a face distance, and the like. Eachsound area in N sound areas is detected, to obtain a user detectionresult corresponding to the sound area. The description is made in thisapplication by using an example in which the user detection resultincludes user identity information and lip motion information, but thisis not to be construed as a limitation on this application.

The user detection result includes user information and the lip motioninformation. The user information includes: whether a user exists, andwhether identity information of the user can be extracted when a userexists. For example, a user exists in the second sound area, and theuser is recognized and determined as “Xiao Li” whose correspondingidentity is “01011”. In another example, no user exists in the fifthsound area, and there is no need to perform recognition. The lip motioninformation indicates whether a lip of the user moves or not. Generally,the lip moves when a person is speaking. Therefore, whether the user isspeaking or not may further be determined based on the lip motioninformation. A sound area identifier corresponding to each sound areaand a sound pointing angle corresponding to the sound area may bedetermined with reference to pre-divided sound areas, thereby generatingsound area information {(i, θ_(i), λ_(i), L_(i))}_(i=1, . . . , N)corresponding to the sound area. i in the sound area information {(i,θ_(i), λ_(i), L_(i))}_(i=1, . . . , N) represents an i^(th) sound area,θ_(i) represents a sound pointing angle of the i^(th) sound area, λ_(i)represents user information of the i^(th) sound area, and L_(i)represents lip motion information of the i^(th) sound area.

In addition, in this embodiment of this application, the manner ofobtaining sound area information based on the CV technology is provided.In the foregoing manner, more sound area information may be detected byusing the CV technology. It is equivalent that a related situation ofthe user in each sound area may be “saw”, for example, whether a userexists, user information of the user, whether the user has lip motion,and the like, so that multi-modal information may be integrated andutilized, thereby further improving the accuracy of speech detectionthrough information in a visual dimension, and providing a feasiblemanner for subsequent processing of solutions of related videos.

In some cases, based on the embodiment corresponding to FIG. 3 , inanother exemplary embodiment provided in the embodiments of thisapplication, the determining user information corresponding to thetarget detection sound area according to a user detection resultcorresponding to the target detection sound area specifically includesthe following steps:

determining a first identity as the user information when the userdetection result corresponding to the target detection sound area isthat a recognizable user exists in the target detection sound area;

determining a second identity as the user information when the userdetection result corresponding to the target detection sound area isthat no user exists in the target detection sound area;

determining a third identity as the user information when the userdetection result corresponding to the target detection sound area isthat an unknown user exists in the target detection sound area; and

the determining lip motion information corresponding to the targetdetection sound area according to the user detection resultcorresponding to the target detection sound area specifically includesthe following steps:

determining a first motion identifier as the lip motion information whenthe user detection result corresponding to the target detection soundarea is that a user with lip motion exists in the target detection soundarea;

determining a second motion identifier as the lip motion informationwhen the user detection result corresponding to the target detectionsound area is that a user exists in the target detection sound area andthe user does not have lip motion; and

determining a third motion identifier as the lip motion information whenthe user detection result corresponding to the target detection soundarea is that no user exists in the target detection sound area.

A specific manner of extracting lip motion information and userinformation based on the CV technology is described in this embodiment.Since the user information and the lip motion information need to bedetermined according to an actual situation, user information and lipmotion information in each sound area need to be detected, which isdescribed in detail below.

First, a recognition manner for user information

For ease of description, any sound area in N sound areas is used as anexample to describe this application, and user information in othersound areas is determined in a similar manner, which is not describedherein. Any sound area may be used as a target detection sound area,assuming that the sound area is an i^(th) sound area, whether a userexists in the i^(th) sound area and whether identity information of theuser can be obtained when a user exists may be determined based on auser detection result of the i^(th) sound area. User informationcorresponding to the i^(th) sound area is represented as λ^(i), which isrepresented as user information in a direction with a sound pointingangle θ_(i). When a user exists in the direction with the sound pointingangle θ_(i) and identity information of the user can be determined, itindicates that a name and identity of the user can be determined, andλ_(i) is a first identity of the user, for example, “5”. When no userexists in the direction with the sound pointing angle θ_(i), λ_(i) maybe set to a special value, that is, a second identity, for example,“−1”. When a function of face recognition is not configured, that is,the identity information of the user cannot be determined, λ_(i) may beset to another special value, that is, a third identity, for example,“0”, to inform a subsequent processing module that although there is auser in the direction, the identity is unknown, and if necessary,identity information of the user may be further recognized throughvoiceprint recognition.

Second, a recognition manner for lip motion information

For ease of description, any sound area in N sound areas is used as anexample to describe this application, and lip motion information inother sound areas is determined in a similar manner, which is notdescribed herein. Any sound area may be used as a target detection soundarea, assuming that the sound area is an i^(th) sound area, whether auser exists in the i^(th) sound area and whether the user has lip motionwhen a user exists may be determined based on a user detection result ofthe i^(th) sound area. The camera generally adopts an unmovablewide-angle camera, detects all people and faces within the visual rangeby using a CV algorithm and cuts a facial local image out, and detects,through the CV algorithm, whether an upper lip on a face is moving. Lipmotion information corresponding to the i^(th) sound area is representedas L_(i), which is represented as lip motion information in a directionwith a sound pointing angle θ_(i). When a user exists in the directionwith the sound pointing angle θ_(i) and the user is determined to havelip motion, L_(i) may be set to a first motion identifier, for example,“0”. When a user exists in the direction with the sound pointing angleθ_(i) but the user does not have lip motion, L_(i) may be set to asecond motion identifier, for example, “1”. When no user exists in thedirection with the sound pointing angle θ_(i), L_(i) may be set to aspecial value, that is, a third motion identifier, for example, “−1”.

In addition, an embodiment of this application provides a specificmanner of extracting lip motion information and user information basedon the CV technology. In the foregoing manner, the user information andthe lip motion information of the user can be analyzed in a plurality ofaspects, the feasibility of recognition may be improved as much aspossible, and information included in each sound area is analyzed in aplurality of dimensions, thereby improving the operability of thetechnical solution.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, thegenerating a control signal corresponding to the target detection soundarea according to sound area information corresponding to the targetdetection sound area specifically includes the following steps:

generating a first control signal when user information corresponding tothe target detection sound area is used for indicating that no userexists in the target detection sound area, the first control signalbelonging to the control signal, and the first control signal being usedfor performing suppression on the speech input signal; and

generating a second control signal when the user informationcorresponding to the target detection sound area is used for indicatingthat a user exists in the target detection sound area, the secondcontrol signal belonging to the control signal, and the second controlsignal being used for performing retention on the speech input signal.

A manner of generating a control signal without adopting the CVtechnology is described in this embodiment. When the CV technology isnot adopted, a user identity cannot be recognized, and lip motioninformation of the user cannot be obtained. In this case, whether a userexists in a current sound area may be estimated by using a spatialspectrum, so that sound area information of N sound areas is obtained,where the sound area information of the N sound areas may be representedas {(i, θ_(i), λ_(i), L_(i))}_(i=1, . . . , N).

For ease of description, any sound area in N sound areas is used as anexample to describe this application, and a control signal in anothersound area may be generated in a similar manner, which is not describedherein. Any sound area may be used as a target detection sound area,assuming that the sound area is an i^(th) sound area, and sound areainformation of the i is {(i, θ_(i), λ_(i))}, where user informationλ_(i) may indicate that no exists in a direction with a sound pointingangle θ_(i) or a user exists in a direction with a sound pointing angleθ_(i), and if necessary, identity information of the user may further berecognized through voiceprint recognition, which is not described indetail herein. During the generation of the control signal, if it isdetected that no user exists in the i^(th) sound area, all signals atthe sound pointing angle θ_(i) may be learned and suppressed through asignal separator, that is, a first control signal is generated throughthe signal separator, and all the signals at the sound pointing angleθ_(i) are suppressed by using the first control signal. If it isdetected that a user exists in the i^(th) sound area, signals at thesound pointing angle θ_(i) may be learned and retained through thesignal separator, that is, a second control signal is generated throughthe signal separator, and the signals at the sound pointing angle θ_(i)are retained by using the second control signal.

In addition, in this embodiment of this application, the manner ofgenerating a control signal without adopting the CV technology isprovided. In the foregoing manner, the control signal can be generatedonly by using audio data. In this way, on the one hand, the flexibilityof the technical solution is improved; and on the other hand, thecontrol signal may also be generated based on less information, therebysaving operation resources, improving the efficiency of generating thecontrol signal, and saving power for the device.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, thegenerating a control signal corresponding to the target detection soundarea according to sound area information corresponding to the targetdetection sound area specifically includes the following steps:

generating a first control signal when user information corresponding tothe target detection sound area is used for indicating that no userexists in the target detection sound area, the first control signalbelonging to the control signal, and the first control signal being usedfor performing suppression on the speech input signal; and

generating the first control signal when the user informationcorresponding to the target detection sound area is used for indicatingthat a user exists in the target detection sound area and the user doesnot have lip motion;

generating a second control signal when the user informationcorresponding to the target detection sound area is used for indicatingthat a user exists in the target detection sound area and the user haslip motion, the second control signal belonging to the control signal,and the second control signal being used for performing retention on thespeech input signal; and

generating the first control signal or the second control signalaccording to an original audio signal when the user informationcorresponding to the target detection sound area is used for indicatingthat a user exists in the target detection sound area and a lip motionsituation of the user is unknown.

A manner of generating a control signal adopting the CV technology isdescribed in this embodiment. When the CV technology is adopted, a useridentity may be recognized, and lip motion information of the user isobtained. In this case, whether a user exists in a current sound areamay be estimated only by using the CV technology, or whether a userexists in a current sound area is determined by using the CV technologyin a spatial spectrum estimation manner, so that sound area informationof N sound areas is obtained, where the sound area information of the Nsound areas may be represented as {(i, θ_(i), λ_(i),L_(i))}_(i=1, . . . , N).

For ease of description, any sound area in N sound areas is used as anexample to describe this application, and a control signal in anothersound area may be generated in a similar manner, which is not describedherein. Any sound area may be used as a target detection sound area,assuming that the sound area is an i^(th) sound area, and sound areainformation of the i is {(i, θ_(i), λ_(i), L_(i))}, where userinformation λ_(i) may be a first identity, a second identity, or a thirdidentity, and lip motion information may be a first motion identifier, asecond motion identifier, or a third motion identifier. Specifically,during the generation of the control signal, if it is detected that nouser exists in the i^(th) sound area, all signals at the sound pointingangle θ_(i) may be learned and suppressed through a signal separator,that is, a first control signal is generated through the signalseparator, and all the signals at the sound pointing angle θ_(i) aresuppressed by using the first control signal. If it is detected that auser exists in the i^(th) sound area, whether the user has lip motionneeds to be further determined.

If it is detected that a user exists in the i^(th) sound area, but theuser does not have lip motion, all signals at the sound pointing angleθ_(i) may be learned and suppressed through a signal separator, that is,a first control signal is generated through the signal separator, andall the signals at the sound pointing angle θ_(i) are suppressed byusing the first control signal.

If it is detected that a user exists in the i^(th) sound area and theuser has lip motion, signals at the sound pointing angle θ_(i) may belearned and retained through the signal separator, that is, a secondcontrol signal is generated through the signal separator, and thesignals at the sound pointing angle θ_(i) are retained by using thesecond control signal.

If it is detected that a user exists in the i^(th) sound area and theuser has lip motion, signals at the sound pointing angle θ_(i) may belearned and retained through the signal separator, that is, a secondcontrol signal is generated through the signal separator, and thesignals at the sound pointing angle 74 _(i) are retained by using thesecond control signal.

If it is detected that a user exists in the i^(th) sound area, the lipcannot be clearly captured by the camera due to an unclear face or arelatively large face deflection angle, resulting in that a lip motionsituation of the user cannot be determined due to. In view of this,spatial spectrum estimation or azimuth estimation needs to be performedon an original audio signal inputted at the sound pointing angle θ_(i),to roughly determine whether the user is speaking. If it is determinedthat the user is speaking, signals at the sound pointing angle θ_(i) maybe learned and retained through the signal separator, that is, a secondcontrol signal is generated through the signal separator, and thesignals at the sound pointing angle θ_(i) are retained by using thesecond control signal. If it is determined that the user does not speak,all signals at the sound pointing angle θ_(i) may be learned andsuppressed through a signal separator, that is, a first control signalis generated through the signal separator, and all the signals at thesound pointing angle θ_(i) are suppressed by using the first controlsignal.

In addition, in this embodiment of this application, the manner ofgenerating a control signal adopting the CV technology is provided. Inthe foregoing manner, the control signal is generated according to bothaudio data and image data. In this way, on the one hand, the flexibilityof the technical solution is improved; and on the other hand, a controlsignal generated based on more information may be more accurate, therebyimproving the accuracy of speech detection.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, thegenerating a control signal corresponding to the target detection soundarea according to sound area information corresponding to the targetdetection sound area specifically includes the following steps:

generating the control signal corresponding to the target detectionsound area according to sound area information corresponding to thetarget detection sound area by using a preset algorithm, the presetalgorithm being an adaptive beamforming algorithm, a blind sourceseparation algorithm, or a deep learning-based speech separationalgorithm; and

the processing a speech input signal corresponding to the targetdetection sound area by using the control signal corresponding to thetarget detection sound area, to obtain a speech output signalcorresponding to the target detection sound area specifically includesthe following steps:

processing, when the preset algorithm is the adaptive beamformingalgorithm, a speech input signal corresponding to the target detectionsound area according to the control signal corresponding to the targetdetection sound area by using the adaptive beamforming algorithm, toobtain the speech output signal corresponding to the target detectionsound area;

processing, when the preset algorithm is the blind source separationalgorithm, the speech input signal corresponding to the target detectionsound area according to the control signal corresponding to the targetdetection sound area by using the blind source separation algorithm, toobtain the speech output signal corresponding to the target detectionsound area; and

processing, when the preset algorithm is the deep learning-based speechseparation algorithm, the speech input signal corresponding to thetarget detection sound area according to the control signalcorresponding to the target detection sound area by using the deeplearning-based speech separation algorithm, to obtain the speech outputsignal corresponding to the target detection sound area.

A manner of signal separation based on the control signal is describedin this embodiment. The preset algorithm adopted during the generationof the control signal is consistent with an algorithm adopted duringsignal separation in actual application. This application provides threepreset algorithms, namely, an adaptive beamforming algorithm, a blindsource separation algorithm, or a deep learning-based speech separationalgorithm. The signal separation is described below with reference tothe three preset algorithms.

1. The Adaptive Beamforming Algorithm

Adaptive beamforming is also referred to as adaptive spatial filtering,spatial filtering processing may be performed by weighting each arrayelement, to enhance useful signals and suppress interference. Inaddition, a weighting factor of each array element may be changedaccording to change of a signal environment. Under an ideal condition,the adaptive beamforming technology may effectively suppressinterference and retain desired signals, thereby maximizing aninterference-to-noise ratio of an output signal of an array.

2. The Blind Source Separation Algorithm

Blind source separation (BSS) means that the source signal is estimatedonly according to an observed mixed signal when a source signal and asignal mixed parameter are unknown. Independent component analysis (ICA)is a new technology gradually developed to resolve the problem of blindsignal separation. The method of independent component analysis ismainly used to resolve the blind signal separation, that is, a receivedmixed signal is decomposed into several independent components accordingto the principle of statistical independence by using an optimizationalgorithm, and such independent components are used as an approximateestimation of the source signal.

3. The Deep Learning-Based Speech Separation Algorithm

Speech separation based on deep learning mainly adopts the method basedon deep learning to learn features of voice, a speaker, and noise,thereby achieving an objective of speech separation. To be specific, amulti-layer perception, a deep neural network (DNN), a convolutionalneural network (CNN), a long short-term memory (LSTM) network, agenerative adversarial network (GAN), or the like may be used. This isnot limited herein.

A generator is usually set to convolution layers in a model when speechenhancement is performed by using the GAN, to reduce trainingparameters, thereby shortening the training time. A discriminator isresponsible for providing authenticity information of generated data tothe generator, and helping the generator to slightly adjust to“generating a clean sound”.

In addition, in this embodiment of this application, the manner ofsignal separation based on the control signal is provided. In theforegoing manner, the adaptive beamforming algorithm is also used duringthe signal separation when the control signal is generated by using theadaptive beamforming algorithm, the blind source separation algorithm isalso used during the signal separation when the control signal isgenerated by using the blind source separation algorithm, and the deeplearning-based speech separation algorithm is also used during thesignal separation when the control signal is generated by using the deeplearning-based speech separation algorithm. In this way, the controlsignal can better coordinate separation of signals, to achieve a bettersignal separation effect, thereby improving the accuracy of speechdetection.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, thegenerating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area specifically includes the following steps:

determining a signal power corresponding to the target detection soundarea according to the speech output signal corresponding to the targetdetection sound area, the signal power is a signal power of the speechoutput signal at a time-frequency point;

determining an estimated signal-to-noise ratio corresponding to thetarget detection sound area according to the signal power correspondingto the target detection sound area;

determining an output signal weighted value corresponding to the targetdetection sound area according to the estimated signal-to-noise ratiocorresponding to the target detection sound area, the output signalweighted value being a weighted result of the speech output signal atthe time-frequency point;

determining a target speech output signal corresponding to the targetdetection sound area according to the output signal weighted valuecorresponding to the target detection sound area and the speech outputsignal corresponding to the target detection sound area; and

determining the speech detection result corresponding to the targetdetection sound area according to the target speech output signalcorresponding to the target detection sound area.

A manner of performing cross-channel post-processing on a speech outputsignal is described in this embodiment. Since a speech output signalafter signal separation is not always clean, cross-channelpost-processing may be performed when a speech output signalcorresponding to each sound pointing angle has a relatively highsignal-to-noise ratio. The signal-to-noise ratio is considered to berelatively high when the considered to be high of the speech outputsignal is higher than −5 decibels. However, a critical value of thesignal-to-noise ratio may further be adjusted according to an actualsituation, and “−5 decibels” is merely an example and is not beconstrued as a limitation on this application.

Each sound area is used as a target detection sound area, and animplementation of cross-channel post-processing includes: firstly,determining a signal power corresponding to the target detection soundarea according to the speech output signal corresponding to the targetdetection sound area; then, calculating an estimated signal-to-noiseratio corresponding to the target detection sound area, and determiningan output signal weighted value corresponding to the target detectionsound area; and finally, determining a target speech output signalcorresponding to the target detection sound area according to the outputsignal weighted value and the speech output signal corresponding to thetarget detection sound area, and determining a speech detection resultcorresponding to the target detection sound area based on the targetspeech output signal. Based on this, for ease of description, any soundarea in N sound areas is used as an example to made a description below,and a target speech output signal in another sound area is alsodetermined in a similar manner, which is not described herein. Any soundarea may be used as a target detection sound area, assuming that thesound area is an i^(th) sound area, a corresponding sound pointing angleis θ_(i), and for each time-frequency point (t, f) of the sound pointingangle θ_(i), an estimated signal-to-noise ratio of the i^(th) sound areamay be calculated by using the following formula:

${{\mu_{i}( {t,f} )} = \frac{P_{i}( {t,f} )}{\max\limits_{{j = 1},\ldots,{{{N\&}j} \neq i}}{P_{j}( {t,f} )}}};$

where

μ_(i)(t, f) represents an estimated signal-to-noise ratio of the i^(th)sound area, (t, f) represents a signal power of a speech output signalin a direction with the sound pointing angle θ_(i) at the time-frequencypoint (t, f), N represents N sound areas (which can also be used as Nsound pointing angles), j represents a j^(th) sound area (which can alsobe used as a j^(th) sound pointing angle), i represents an i^(th) soundarea (which can also be used as an i^(th) sound pointing angle), trepresents a time, and f represents a frequency.

Next, an output signal weighted value of the i^(th) sound area iscalculated below by using a formula of Wiener filtering:

${{g_{i}( {t,f} )} = \sqrt{\frac{\mu_{i}( {t,f} )}{{\mu_{i}( {t,f} )} + 1}}};$

where

g_(i)(t′ f) represents an output signal weighted value of the i^(th)sound area, that is, a weight of the speech output signal in thedirection with the sound pointing angle θ_(i) at the time-frequencypoint (t, f).

Finally, based on the output signal weighted value of the i^(th) soundarea and the speech output signal of the i^(th) sound area, a targetspeech output signal of the i^(th) sound area may be calculated by usingthe following formula:

y _(i)(t, f)=x _(i)(t, f)*g _(i)(t, f); where5

y_(i)(t, f) represents a target speech output signal of the i^(th) soundarea, that is, a target speech output signal calculated in the soundpointing angle θ_(i) by using a cross-channel post-processing algorithm.x_(i)(t, f) represents a speech output signal of the i^(th) sound area,that is, a speech output signal in the direction with the sound pointingangle θ_(i). It may be understood that, the target speech output signaly_(i)(t, f) in this embodiment is a speech output signal having not beennoise-reduced.

In addition, in this embodiment of this application, the manner ofperforming cross-channel post-processing on a speech output signal isprovided. In the foregoing manner, considering a correlation betweendifferent sound areas, speech signals may be better separated byperforming cross-channel post-processing, and especially when thesignal-to-noise ratio is high enough, the purity of the speech signalmay be improved, thereby further improving the quality of the outputsignal.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, thedetermining a target speech output signal corresponding to the targetdetection sound area according to the output signal weighted valuecorresponding to the target detection sound area and the speech outputsignal corresponding to the target detection sound area specificallyincludes the following steps:

determining a to-be-processed speech output signal corresponding to thetarget detection sound area according to the output signal weightedvalue corresponding to the target detection sound area and the speechoutput signal corresponding to the target detection sound area; and

performing noise reduction on the to-be-processed speech output signalcorresponding to the target detection sound area, to obtain the targetspeech output signal corresponding to the target detection sound area.

A manner of performing noise reduction on a to-be-processed speechoutput signal is described in this embodiment. For ease of description,any sound area in N sound areas is used as an example to made adescription below, and a target speech output signal in another soundarea is also determined in a similar manner, which is not describedherein. Any sound area may be used as a target detection sound area,assuming that the sound area is an i^(th) sound area, a correspondingsound pointing angle is θ_(i). As can be seen from the foregoingembodiments that, the target speech output signal of the i^(th) soundarea may be calculated according to the output signal weighted value ofthe i^(th) sound area and the speech output signal of the i^(th) soundarea. However, when noise reduction is required, based on the outputsignal weighted value of the i^(th) sound area and the speech outputsignal of the i^(th) sound area, a to-be-processed speech output signalof the i^(th) sound area may be calculated by using a specific formulabelow:

y′ _(i)(t, f)=x _(i)(t, f)*g _(i)(t, f); where

y′_(i)(t, f) represents a to-be-processed speech output signal of thei^(th) sound area, that is, a to-be-processed speech output signalcalculated in the sound pointing angle θ_(i) by using a cross-channelpost-processing algorithm. x_(i)(t, f) represents a speech output signalof the i^(th) sound area, that is, a speech output signal in thedirection with the sound pointing angle θ_(i). It may be understoodthat, different from the foregoing embodiments, the to-be-processedspeech output signal y′_(i)(t, f) in this embodiment is a speech outputsignal having not been noise-reduced, but the target speech outputsignal y_(i)(t, f) in this embodiment is a speech output signal afternoise reduction.

In view of this, noise reduction is performed on the to-be-processedspeech output signal y′_(i)(t, f), to obtain a target speech outputsignal y_(i)(t, f) corresponding to each sound area.

A feasible filtering manner is that, noise reduction is performed byusing a least mean square (LMS) adaptive filter, and the LMS adaptivefilter automatically adjusts a current filter parameter by using afilter parameter obtained at a previous moment, to adapt to unknown orrandomly changed statistical characteristics of the signal and noise,thereby achieving optimal filtering. Another feasible filtering manneris that, noise reduction is performed by using an LMS adaptive notchfilter, and the adaptive notch filter is adapted to monochromaticinterference noise, for example, single frequency sine wave noise, andit is expected that notch filter has ideal characteristics, and theshoulder of the notch is arbitrarily narrow, which can immediately entera flat region. Another feasible filtering manner is that, noisereduction is performed by using a basic spectral subtraction algorithm,since the to-be-processed speech output signal is not sensitive to aphase, phase information before spectral subtraction is used in a signalafter spectral subtraction, and after an amplitude after spectralsubtraction is calculated, a target speech output signal after spectralsubtraction can be calculated by performing inverse fast Fouriertransform (IFFT) with reference to a phase angle. Another feasiblefiltering manner is that noise reduction is performed through Wienerfiltering. The foregoing examples are merely feasible solutions, andanother noise reduction manner may also be adopted in actualapplication. This is not limited herein.

In addition, in this embodiment of this application, the manner ofperforming noise reduction on a to-be-processed speech output signal isdescribed. In the foregoing manner, noise, interference human voice, andresidual echo can be further suppressed, thereby better improving thequality of the target speech output signal and increasing the accuracyof speech detection.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, thegenerating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area specifically includes the following steps:

generating a first speech detection result when the target speech outputsignal corresponding to the target detection sound area meets a humanvoice matching condition, the first speech detection result belonging tothe speech detection result, and the first speech detection resultindicating that the target speech output signal is a human voice signal;and

generating a second speech detection result when the target speechoutput signal corresponding to the target detection sound area does notmeet the human voice matching condition, the second speech detectionresult belonging to the speech detection result, and the second speechdetection result indicating that the target speech output signal is anoise signal.

A manner of performing speech detection on each sound area is describedin this embodiment. During the speech detection, whether a speech outputsignal corresponding to each sound area meets a human voice matchingcondition needs to be determined. The “target speech output signal” inthis embodiment is obtained by performing cross-channel post-processingand noise reduction post-processing on the speech output signal. Speechdetection is performed on the “speech output signal” when the speechoutput signal has not been subjected to cross-channel post-processingand noise reduction post-processing. Speech detection may be performedon the “to-be-processed speech output signal” when the speech outputsignal has only been subject to cross-channel post-processing withoutbeing subject to noise reduction post-processing. The “target speechoutput signal” is used as an example to describe this application, whichshall not be construed as a limitation on this application.

How to determine whether the human voice matching condition is met basedon the target speech output signal is described below. For ease ofdescription, any sound area in N sound areas is used as an example tomade a description below, and a speech detection result in another soundarea is also determined in a similar manner, which is not describedherein. Any sound area may be used as a target detection sound area.During the detection, whether a sound area meets the human voicematching condition may be determined according to any one of the targetspeech output signal, the lip motion information, the user information,or the voiceprint, and the description is made below with reference toseveral examples.

First case: when the target speech output signal is not received, thatis, the user does not speak, it is determined that the human voicematching condition is not met.

Second case: when the received target speech output signal is very weakor not like human voice, it can be determined that the user does notspeak in a sound pointing angle direction corresponding a sound area inthis case, and the human voice matching condition is not met.

Third case: when the received target speech output signal is human voicethat is extremely mismatched (for example, a matching score is less than0.5) with voiceprint of given user information, it may be determinedthat, the user does not speak in a sound pointing angle directioncorresponding a sound area in this case, the target speech output signalis a noise signal leaked from human voice in other directions to a localsound channel, and the human voice matching condition is not met.

Fourth case: when the received target speech output signal is humanvoice, but the lip motion information indicates that the user does nothave lip motion and a degree of voiceprint matching is not high, it maybe also determined that the user does not speak in a sound pointingangle direction corresponding a sound area in this case, the targetspeech output signal is a noise signal leaked from human voice in otherdirections to a local sound channel, and the human voice matchingcondition is not met.

Corresponding voiceprint may be obtained from a database based on theuser information (assuming that the user has registered with the userinformation), whether a target speech output signal in a current channelmatches voiceprint of the user may be determined according to thevoiceprint. When the matching succeeds, it is determined that the humanvoice matching condition is met; and when the matching fails, it may bedetermined that the target speech output signal is a noise signal leakedfrom human voice in other directions to a local sound channel, that is,the human voice matching condition is not met.

The foregoing four cases are merely examples, and in actual application,another determining manner may also be flexibly set according tosituations. This is not limited herein. If it is determined that thetarget speech output signal meets the human voice matching condition, afirst speech detection result is generated, that is, indicating that thetarget speech output signal is a normal human voice signal; and on thecontrary, if it is determined that the target speech output signal doesnot meet the human voice matching condition, a second speech detectionresult is generated, that is, indicating that the target speech outputsignal is a noise signal.

In addition, in this embodiment of this application, the manner ofperforming speech detection on each sound area is provided. In theforegoing manner, whether the human voice matching condition is metneeds to be determined for each sound area, and it is considered thatthe human voice matching condition is not met when even if a user existsin some sound areas, the user does not speak or speaks in a very lowvoice, or identity information of the user does not match presetidentity information. Therefore, improve the accuracy of speechdetection, whether the speech output signal corresponding to the soundarea meets the human voice matching condition may be determined from aplurality of dimensions, thereby improving the feasibility andoperability of the solution.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, after thegenerating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area, the method may further include the followingsteps:

determining a target sound area from the M sound areas according to aspeech output signal corresponding to each sound area in the M soundareas when speech detection results corresponding to M sound areas arefirst speech detection results, the first speech detection resultindicating that the speech output signal is the human voice signal, theM sound areas belonging to the N sound areas, M being an integer greaterthan or equal to 1 and less than or equal to N; and

transmitting a speech output signal corresponding to the target soundarea to a calling party.

A manner of calling based on a speech detection result is described inthis embodiment. As can be seen from the foregoing embodiments that, asound area corresponding to the first speech detection result isselected after a speech detection result corresponding to each soundarea in N sound areas is obtained. This is because in a call scenario,it is necessary to transmit human voice and suppress noise to improvethe call quality, where the first speech detection result indicates thatthe speech output signal of the sound area is a human voice signal. Itis to be understood that the “speech output signal” in this embodimentmay also be the “to-be-processed speech output signal” or the “targetspeech output signal”, which can be flexibly selected in a specificprocessing process. The description herein is merely an example, andshall not be construed as a limitation on this application.

Assuming that speech detection results of M sound areas in N sound areasare first speech detection results, that is, a speech output signal (ora target speech output signal or a to-be-processed speech output signal)corresponding to each sound area in the M sound areas. Based on this, amain speaker may further be judged based on speech output signals of theM sound areas, where each sound area in the M sound areas is referred toas a “target sound area”. For ease of description, referring to FIG. 6 ,FIG. 6 is another schematic architectural diagram of a multi-channelsound pickup system according to an embodiment of this application. Asshown in the figure, the microphone array equipped on the terminaldevice may pick up an audio signal corresponding to each sound area,where the audio signal includes a speech input signal and a noisesignal. A control signal corresponding to each sound area is generatedby a signal separator, and suppression or retention is performed on aspeech input signal of each sound pointing angle by using the controlsignal corresponding to the sound area, to obtain a speech output signalcorresponding to the sound area. A speech detection result of the soundarea is determined based on sound area information and the target speechoutput signal of the sound area.

A main speaker judging module determines a main speaker in real timeaccording to speech output signals and sound area information of M soundareas, for example, when a delay of a judgment result is required to behigh, the main speaker judging module may directly measure an originalvolume of a speaker (a volume at a mouth) according to a signal strengthof each speaker received in a short time and the distance (which may beprovided by a wide-angle camera or a multi-camera array) between thespeaker and the microphone array, so that the main speaker is determinedaccording to the original volume. In another example, when the delay ofthe judgment result is required to be high, the main speaker may bedetermined according to a facial orientation of each speaker (forexample, in a video conference scenario, a user whose face is facing thecamera is more likely to be the main speaker). The judgment result ofthe main speaker includes an orientation and an identity of the mainspeaker, and the judgment result is outputted to a mixer for a calldemand. The mixer merges N continuous audio streams into one or morechannels of output audio according to the judgment result of the mainspeaker, to meet call requirements. In an implementation, when the mainspeaker is determined to be in a direction with a sound pointing angleθ_(i), outputted single channel audio is equal to a speech output signalinputted in the first channel, and input data of another channel isdirectly discarded. In an implementation, when the main speaker isdetermined to be in a direction with a sound pointing angle θ_(i) and ina direction with a sound pointing angle θ₄, outputted audio is equal tothe speech output signal inputted in the first channel and a speechoutput signal inputted in the fourth channel, and input data of anotherchannel is directly discarded.

As can be seen from FIG. 6 that, cross-channel post-processing and noisereduction post-processing may further be performed on the speech outputsignal, to obtain a target speech output signal corresponding to eachsound area, and a speech detection result of the sound area isdetermined based on the sound area information and the target speechoutput signal of the sound area.

Referring to FIG. 7 , FIG. 7 is a schematic diagram of an interface ofimplementing calling by using a multi-sound area-based speech detectionmethod according to an embodiment of this application. As shown in thefigure, using an example in which the method is applied to a callscenario, there are a plurality of users in our participants. Therefore,a main speaker may be determined by using the technical solutionsprovided in this application, voice of the main speaker is transmittedto a user a, while voice of another speaker or noise may be suppressed,so that the user a can hear clearer voice.

In this embodiment of this application, the manner of calling based on aspeech detection result is provided. In the foregoing manner, speech ofeach user can be separated and enhanced in real time in a multi-userscenario, so that high-quality call can be achieved in the call scenarioaccording to the speech detection result and based on processes ofmulti-user parallel separation enhancement processing and post-mixedprocessing.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, after thegenerating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area, the method may further include the followingsteps:

determining a target sound area from the M sound areas according to aspeech output signal corresponding to each sound area in the M soundareas when speech detection results corresponding to M sound areas arefirst speech detection results, the first speech detection resultindicating that the speech output signal is the human voice signal, theM sound areas belonging to the N sound areas, M being an integer greaterthan or equal to 1 and less than or equal to N; and

performing semantic recognition on a speech output signal correspondingto the target sound area, to obtain a semantic recognition result; and

generating dialog response information according to the semanticrecognition result.

A manner of feeding back dialog response information based on a speechdetection result is provided in this embodiment. As can be seen from theforegoing embodiments that, a sound area corresponding to the firstspeech detection result is selected after a speech detection resultcorresponding to each sound area in N sound areas is obtained. This isbecause in an intelligent dialog scenario, it is necessary to transmithuman voice and suppress noise to improve the accuracy of theintelligent dialog, where the first speech detection result indicatesthat the speech output signal of the sound area is a human voice signal.It is to be understood that the “speech output signal” in thisembodiment may also be the “to-be-processed speech output signal” or the“target speech output signal”, which can be flexibly selected in aspecific processing process. The description herein is merely anexample, and shall not be construed as a limitation on this application.

Assuming that speech detection results of M sound areas in N sound areasare first speech detection results, that is, a speech output signal (ora target speech output signal or a to-be-processed speech output signal)corresponding to each sound area in the M sound areas. Based on this, amain speaker may further be judged based on speech output signals of theM sound areas, where each sound area in the M sound areas is referred toas a “target sound area”. For ease of description, referring to FIG. 8 ,FIG. 8 is another schematic architectural diagram of a multi-channelsound pickup system according to an embodiment of this application. Asshown in the figure, the microphone array equipped on the terminaldevice may pick up an audio signal corresponding to each sound area,where the audio signal includes a speech input signal and a noisesignal. A control signal corresponding to each sound area is generatedby a signal separator, and suppression or retention is performed on aspeech input signal of each sound pointing angle by using the controlsignal corresponding to the sound area, to obtain a speech output signalcorresponding to the sound area. A speech detection result of the soundarea is determined based on sound area information and the target speechoutput signal of the sound area.

Then, nature language processing (NLP) is performed on a speech outputsignal corresponding to each target sound area in M sound areas, so thatan intention of a speaker in the target sound area is obtained, that is,a semantic recognition result is obtained.

A main speaker judging module determines a main speaker in real timeaccording to speech output signals and sound area information of M soundareas, for example, when a delay of a judgment result is required to behigh, the main speaker judging module may directly measure an originalvolume of a speaker (a volume at a mouth) according to a signal strengthof each speaker received in a short time and the distance (which may beprovided by a wide-angle camera or a multi-camera array) between thespeaker and the microphone array, so that the main speaker is determinedaccording to the original volume. In another example, when the delay ofthe judgment result is required to be high, the main speaker may bedetermined according to a semantic recognition result and a facialorientation of each speaker (for example, in a video conferencescenario, a user whose face is facing the camera is more likely to bethe main speaker). The judgment result of the main speaker includes anorientation and an identity of the main speaker, and the judgment resultis used as a basis for generating dialog response information, toaccordingly reply to dialog response information corresponding to theintention of the main speaker.

As can be seen from FIG. 8 that, cross-channel post-processing and noisereduction post-processing may further be performed on the speech outputsignal, to obtain a target speech output signal corresponding to eachsound area, and a speech detection result of the sound area isdetermined based on the sound area information and the target speechoutput signal of the sound area.

Referring to FIG. 9 , FIG. 9 is a schematic diagram of an interface ofimplementing dialog responding implemented by using a multi-soundarea-based speech detection method according to an embodiment of thisapplication. As shown in the figure, using an example in which themethod is applied to an intelligent dialog, assuming that there are aplurality of speakers in our party, a main speaker may be determined byusing the technical solutions provided in this application, andaccording to the judgment result and the semantic recognition result ofthe main speaker, a reply is made to “Xiaoteng, what day is it today?”said by the main speaker, that is, dialog response information isgenerated, for example, “Hi, today is Friday”.

In actual application, the method may further be applied to a scenariosuch as intelligent customer service, human-machine dialog, and thelike, so that synchronous, real-time, and independent semantic analysismay be performed on each speaker, functions such as manually blocking orenabling may be performed on the speaker, and functions such asautomatically blocking or enabling may be performed on the speaker.

In this embodiment of this application, the manner of feeding backdialog response information based on a speech detection result isprovided. In the foregoing manner, speech of each user can be separatedand enhanced in real time in a multi-user scenario, to determine a mainspeaker according to the speech detection result and the semanticrecognition result in the intelligent dialog scenario, and improve thespeech quality based on processes of multi-user parallel separationenhancement processing and post-mixed processing, so that dialogresponse information can be separately fed back according to thesemantic recognition result, and non-interactive speech may be filtered.

Based on the embodiment corresponding to FIG. 3 , in another exemplaryembodiment provided in the embodiments of this application, after thegenerating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area, the method may further include the followingsteps:

determining a target sound area from the M sound areas according to aspeech output signal corresponding to each sound area in the M soundareas when speech detection results corresponding to M sound areas arefirst speech detection results, the first speech detection resultindicating that the speech output signal is the human voice signal, theM sound areas belonging to the N sound areas, M being an integer greaterthan or equal to 1 and less than or equal to N; and

performing segmentation processing on a speech output signalcorresponding to the target sound area, to obtain to-be-recognized audiodata;

performing speech recognition on to-be-recognized audio datacorresponding to the target sound area, to obtain a speech recognitionresult; and

generating text record information according to a speech recognitionresult corresponding to the target sound area, the text recordinformation including at least one of translation text or conferencerecord text.

A manner of generating text record information based on a speechdetection result is provided in this embodiment. As can be seen from theforegoing embodiments that, a sound area corresponding to the firstspeech detection result is selected after a speech detection resultcorresponding to each sound area in N sound areas is obtained. This isbecause in a translation or record scenario, it is necessary to transmithuman voice and suppress noise to improve the accuracy of translation orrecord, where the first speech detection result indicates that thespeech output signal of the sound area is a human voice signal. It is tobe understood that the “speech output signal” in this embodiment mayalso be the “to-be-processed speech output signal” or the “target speechoutput signal”, which can be flexibly selected in a specific processingprocess. The description herein is merely an example, and shall not beconstrued as a limitation on this application.

Assuming that speech detection results of M sound areas in N sound areasare first speech detection results, that is, a speech output signal (ora target speech output signal or a to-be-processed speech output signal)corresponding to each sound area in the M sound areas. Based on this, amain speaker may further be judged based on speech output signals of theM sound areas, where each sound area in the M sound areas is referred toas a “target sound area”. For ease of description, referring to FIG. 10, FIG. 10 is another schematic architectural diagram of a multi-channelsound pickup system according to an embodiment of this application. Asshown in the figure, the microphone array equipped on the terminaldevice may pick up an audio signal corresponding to each sound area,where the audio signal includes a speech input signal and a noisesignal. A control signal corresponding to each sound area is generatedby a signal separator, and suppression or retention is performed on aspeech input signal of each sound pointing angle by using the controlsignal corresponding to the sound area, to obtain a speech output signalcorresponding to the sound area. A speech detection result of the soundarea is determined based on sound area information and the target speechoutput signal of the sound area.

Then, a speech output signal corresponding to each target sound area inM sound areas is segmented, that is, a stop position of each speechoutput signal is determined, to obtain to-be-recognized audio data. Inaddition, each piece of to-be-recognized audio data carries userinformation, where the user information may be a user identifier. Theto-be-recognized audio data and the user information are both used forsubsequent speech recognition tasks. Then, to-be-recognized audio datacorresponding to the target sound area in the M sound areas is processedby using the ASR technology, so that speech content of a speaker in thetarget sound area is obtained, that is, a speech recognition result isobtained.

A main speaker judging module determines a main speaker in real timeaccording to speech output signals and sound area information of M soundareas, for example, when a delay of a judgment result is required to behigh, the main speaker judging module may directly measure an originalvolume of a speaker (a volume at a mouth) according to a signal strengthof each speaker received in a short time and the distance (which may beprovided by a wide-angle camera or a multi-camera array) between thespeaker and the microphone array, so that the main speaker is determinedaccording to the original volume. In another example, when the delay ofthe judgment result is required to be high, the main speaker may bedetermined according to a speech recognition result and a facialorientation of each speaker (for example, in a video conferencescenario, a user whose face is facing the camera is more likely to bethe main speaker). The judgment result of the main speaker includes anorientation and an identity of the main speaker, and the judgment resultis used as a basis for generating text record information, toaccordingly display dialog response information, where the text recordinformation includes at least one of translation text or conferencerecord text.

It may be understood that, by using the ASR technology, the segmentedto-be-recognized audio data may be transmitted together with voiceprintto an ASR module in a cloud in a regular manner or in a form of amachine learning model. Generally, a voiceprint identifier or avoiceprint model parameter is sent to the ASR module in the cloud, sothat the ASR module may further improve a recognition rate of the moduleby using voiceprint information.

As can be seen from FIG. 10 that, cross-channel post-processing andnoise reduction post-processing may further be performed on the speechoutput signal, to obtain a target speech output signal corresponding toeach sound area, and a speech detection result of the sound area isdetermined based on the sound area information and the target speechoutput signal of the sound area. In addition, an object of a speechsignal to be segmented is a target speech output signal corresponding toeach target sound area.

Referring to FIG. 11 , FIG. 11 is a schematic diagram of implementingtext recording by using a multi-sound area-based speech detection methodaccording to an embodiment of this application. As shown in the figure,using an example in which the method is applied to a simultaneousinterpretation scenario, assuming that there are a plurality of speakersin our party, a main speaker may be determined by using the technicalsolutions provided in this application, and a paragraph said by the mainspeaker may be interpreted in real time according to a judgment resultand a speech recognition result of the main speaker. For example, themain speaker is a user A, and the user A says that

,

. In this case, the text record information may be displayed in realtime, for example, “The main content of this meeting is to let everyonehave a better understanding of this year's work objectives and improvework efficiency”.

In actual application, the method may further be applied to a scenariosuch as translation, conference record, conference assistant, and thelike, so that synchronous, real-time, and independent speech recognition(for example, complete conference transcription) may be performed oneach speaker, functions such as manually blocking or enabling may beperformed on the speaker, and functions such as automatically blockingor enabling may be performed on the speaker.

In this embodiment of this application, the manner of generating textrecord information based on a speech detection result is provided. Inthe foregoing manner, speech of each user can be separated and enhancedin real time in a multi-user scenario, so that starting and ending timepoints of each speaker may be accurately distinguished according to thespeech detection result and the speech recognition result in theintelligent dialog scenario; speech of each speaker is separatelyrecognized, to achieve more accurate speech recognition performance forsubsequent semantic understanding performance and translationperformance; and the speech quality is improved based on processes ofmulti-user parallel separation enhancement processing and post-mixedprocessing, thereby improving the accuracy of the text recordinformation.

The speech detection apparatus provided in this application is describedin detail below. Referring to FIG. 12 , FIG. 12 is a schematic diagramof an embodiment of a speech detection apparatus according to anembodiment of this application. The speech detection apparatus 20includes:

an obtaining module 201, configured to obtain sound area informationcorresponding to each sound area in N sound areas, the sound areainformation including a sound area identifier, a sound pointing angle,and user information, the sound area identifier being used foridentifying a sound area, the sound pointing angle being used forindicating a central angle of the sound area, the user information beingused for indicating a user existence situation in the sound area, Nbeing an integer greater than 1;

a generation module 202, configured to use the sound area as a targetdetection sound area, and generate a control signal corresponding to thetarget detection sound area according to sound area informationcorresponding to the target detection sound area, the control signalbeing used for performing suppression or retention on a speech inputsignal, the control signal and the sound area being in a one-to-onecorrespondence; and

a processing module 203, configured to process a speech input signalcorresponding to the target detection sound area by using the controlsignal corresponding to the target detection sound area, to obtain aspeech output signal corresponding to the target detection sound area,the control signal, the speech input signal, and the speech outputsignal being in a one-to-one correspondence;

the generation module 202 being further configured to generate a speechdetection result of the target detection sound area according to thespeech output signal corresponding to the target detection sound area.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the obtaining module 201 is further configured to: detect the sound areain the N sound areas, to obtain a user detection result corresponding tothe sound area;

use the sound area as the target detection sound area, and determineuser information corresponding to the target detection sound areaaccording to a user detection result corresponding to the targetdetection sound area;

determine lip motion information corresponding to the target detectionsound area according to the user detection result corresponding to thetarget detection sound area;

obtain a sound area identifier corresponding to the target detectionsound area and a sound pointing angle corresponding to the targetdetection sound area; and

generate the sound area information corresponding to the targetdetection sound area according to the user information corresponding tothe target detection sound area, the lip motion informationcorresponding to the target detection sound area, the sound areaidentifier corresponding to the target detection sound area, and thesound pointing angle corresponding to the target detection sound area.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the obtaining module 201 is further configured to: determine a firstidentity as the user information when the user detection resultcorresponding to the target detection sound area is that a recognizableuser exists in the target detection sound area;

determine a second identity as the user information when the userdetection result corresponding to the target detection sound area isthat no user exists in the target detection sound area; and

determine a third identity as the user information when the userdetection result corresponding to the target detection sound area isthat an unknown user exists in the target detection sound area; and

the obtaining module 201 is further configured to: determine a firstmotion identifier as the lip motion information when the user detectionresult corresponding to the target detection sound area is that a userwith lip motion exists in the target detection sound area;

determine a second motion identifier as the lip motion information whenthe user detection result corresponding to the target detection soundarea is that a user exists in the target detection sound area and theuser does not have lip motion; and

determine a third motion identifier as the lip motion information whenthe user detection result corresponding to the target detection soundarea is that no user exists in the target detection sound area.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the generation module 202 is further configured to: generate a firstcontrol signal when user information corresponding to the targetdetection sound area is used for indicating that no user exists in thetarget detection sound area, the first control signal belonging to thecontrol signal, and the first control signal being used for performingsuppression on the speech input signal; and

generate a second control signal when the user information correspondingto the target detection sound area is used for indicating that a userexists in the target detection sound area, the second control signalbelonging to the control signal, and the second control signal beingused for performing retention on the speech input signal.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the generation module 202 is further configured to: generate a firstcontrol signal when user information corresponding to the targetdetection sound area is used for indicating that no user exists in thetarget detection sound area, the first control signal belonging to thecontrol signal, and the first control signal being used for performingsuppression on the speech input signal;

generate the first control signal when the user informationcorresponding to the target detection sound area is used for indicatingthat a user exists in the target detection sound area and the user doesnot have lip motion;

generate a second control signal when the user information correspondingto the target detection sound area is used for indicating that a userexists in the target detection sound area and the user has lip motion,the second control signal belonging to the control signal, and thesecond control signal being used for performing retention on the speechinput signal; and

generate the first control signal or the second control signal accordingto an original audio signal when the user information corresponding tothe target detection sound area is used for indicating that a userexists in the target detection sound area and a lip motion situation ofthe user is unknown.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the generation module 202 is further configured to generate the controlsignal corresponding to the target detection sound area according tosound area information corresponding to the target detection sound areaby using a preset algorithm, the preset algorithm being an adaptivebeamforming algorithm, a blind source separation algorithm, or a deeplearning-based speech separation algorithm; and

the processing module 203 is further configured to: process, when thepreset algorithm is the adaptive beamforming algorithm, a speech inputsignal corresponding to the target detection sound area according to thecontrol signal corresponding to the target detection sound area by usingthe adaptive beamforming algorithm, to obtain the speech output signalcorresponding to the target detection sound area;

process, when the preset algorithm is the blind source separationalgorithm, the speech input signal corresponding to the target detectionsound area according to the control signal corresponding to the targetdetection sound area by using the blind source separation algorithm, toobtain the speech output signal corresponding to the target detectionsound area; and

process, when the preset algorithm is the deep learning-based speechseparation algorithm, the speech input signal corresponding to thetarget detection sound area according to the control signalcorresponding to the target detection sound area by using the deeplearning-based speech separation algorithm, to obtain the speech outputsignal corresponding to the target detection sound area.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the generation module 202 is further configured to: determine a signalpower corresponding to the target detection sound area according to thespeech output signal corresponding to the target detection sound area,the signal power is a signal power of the speech output signal at atime-frequency point;

determine an estimated signal-to-noise ratio corresponding to the targetdetection sound area according to the signal power corresponding to thetarget detection sound area;

determine an output signal weighted value corresponding to the targetdetection sound area according to the estimated signal-to-noise ratiocorresponding to the target detection sound area, the output signalweighted value being a weighted result of the speech output signal atthe time-frequency point;

determine a target speech output signal corresponding to the targetdetection sound area according to the output signal weighted valuecorresponding to the target detection sound area and the speech outputsignal corresponding to the target detection sound area; and

determine the speech detection result corresponding to the targetdetection sound area according to the target speech output signalcorresponding to the target detection sound area.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the generation module 202 is further configured to: determine ato-be-processed speech output signal corresponding to the targetdetection sound area according to the output signal weighted valuecorresponding to the target detection sound area and the speech outputsignal corresponding to the target detection sound area; and

perform noise reduction on the to-be-processed speech output signalcorresponding to the target detection sound area, to obtain the targetspeech output signal corresponding to the target detection sound area.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in this embodiment of thisapplication:

the generation module 202 is further configured to: generate a firstspeech detection result when the target speech output signalcorresponding to the target detection sound area meets a human voicematching condition, the first speech detection result belonging to thespeech detection result, and the first speech detection resultindicating that the target speech output signal is a human voice signal;and

generate a second speech detection result when the target speech outputsignal corresponding to the target detection sound area does not meetthe human voice matching condition, the second speech detection resultbelonging to the speech detection result, and the second speechdetection result indicating that the target speech output signal is anoise signal.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in the embodiments of thisapplication, the speech detection apparatus 20 further includes adetermining module 204 and a transmission module 205, where

the determining module 204 is configured to: after the generating, bythe generation module 202, a speech detection result of the targetdetection sound area according to the speech output signal correspondingto the target detection sound area, determine a target sound area fromthe M sound areas according to a speech output signal corresponding toeach sound area in the M sound areas when speech detection resultscorresponding to M sound areas are first speech detection results, thefirst speech detection result indicating that the speech output signalis the human voice signal, the M sound areas belong to the N soundareas, M being an integer greater than or equal to 1 and less than orequal to N; and

the transmission module 205 is configured to transmit a speech outputsignal corresponding to the target sound area to a calling party.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in the embodiments of thisapplication, the speech detection apparatus 20 further includes adetermining module 204 and a recognition module 206, where

the determining module 204 is configured to: after the generating, bythe generation module 202, a speech detection result of the targetdetection sound area according to the speech output signal correspondingto the target detection sound area, determine a target sound area fromthe M sound areas according to a speech output signal corresponding toeach sound area in the M sound areas when speech detection resultscorresponding to M sound areas are first speech detection results, thefirst speech detection result indicating that the speech output signalis the human voice signal, the M sound areas belong to the N soundareas, M being an integer greater than or equal to 1 and less than orequal to N; and

the recognition module 206 is configured to perform semantic recognitionon a speech output signal corresponding to the target sound area, toobtain a semantic recognition result; and

the generation module 202 is further configured to generate dialogresponse information according to the semantic recognition result.

Based on the embodiment corresponding to FIG. 12 , in another embodimentof the speech detection apparatus 20 provided in the embodiments of thisapplication, the speech detection apparatus 20 further includes adetermining module 204 and a recognition module 206, where

the determining module 204 is configured to: after the generating, bythe generation module 202, a speech detection result of the targetdetection sound area according to the speech output signal correspondingto the target detection sound area, determine a target sound area fromthe M sound areas according to a speech output signal corresponding toeach sound area in the M sound areas when speech detection resultscorresponding to M sound areas are first speech detection results, thefirst speech detection result indicating that the speech output signalis the human voice signal, the M sound areas belong to the N soundareas, M being an integer greater than or equal to 1 and less than orequal to N; and

the processing module 203 is configured to perform segmentationprocessing on a speech output signal corresponding to the target soundarea, to obtain to-be-recognized

the recognition module 206 is configured to perform speech recognitionon to-be-recognized audio data corresponding to the target sound area,to obtain a speech recognition result; and

the generation module 202 is further configured to generate text recordinformation according to a speech recognition result corresponding tothe target sound area, the text record information including at leastone of translation text or conference record text.

FIG. 13 is a schematic structural diagram of a computer device 30according to an embodiment of this application. The computer device 30may include an input device 310, an output device 320, a processor 330,and a memory 340. The output device in this embodiment of thisapplication may be a display device. The memory 340 may include a ROMand a RAM, and provides an instruction and data to the processor 330. Apart of the memory 340 may further include a non-volatile random accessmemory (NVRAM).

The memory 340 stores the following elements, executable modules or datastructures, or a subset thereof, or an extended set thereof:

operation instructions: including various operation instructions, usedfor implementing various operations; and

an operating system: including various system programs, used forimplementing various fundamental services and processing hardware-basedtasks.

In this embodiment of this application, the processor 330 is configuredto:

obtain sound area information corresponding to each sound area in Nsound areas, the sound area information including a sound areaidentifier, a sound pointing angle, and user information, the sound areaidentifier being used for identifying a sound area, the sound pointingangle being used for indicating a central angle of the sound area, theuser information being used for indicating a user existence situation inthe sound area, N being an integer greater than 1;

use the sound area as a target detection sound area, and generate acontrol signal corresponding to the target detection sound areaaccording to sound area information corresponding to the targetdetection sound area, the control signal being used for performingsuppression or retention on a speech input signal, the control signaland the sound area being in a one-to-one correspondence;

process a speech input signal corresponding to the target detectionsound area by using the control signal corresponding to the targetdetection sound area, to obtain a speech output signal corresponding tothe target detection sound area, the control signal, the speech inputsignal, and the speech output signal being in a one-to-onecorrespondence; and

generate a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area.

The processor 330 controls an operation of the computer device 30, andthe processor 330 may also be referred to as a central processing unit(CPU). The memory 340 may include a ROM and a RAM, and provides aninstruction and data to the processor 330. A part of the memory 340 mayfurther include an NVRAM. During specific application, all components ofthe computer device 30 are coupled by using a bus system 350, andbesides a data bus, the bus system 350 may further include a powersource bus, a control bus, a state signal bus, and the like. However,for clear description, various types of buses in the figure are markedas the bus system 350.

The method disclosed in the foregoing embodiments of this applicationmay be applied to the processor 330, or may be implemented by theprocessor 330. The processor 330 may be an integrated circuit chip,having a capability of processing a signal. In an implementationprocess, steps in the foregoing methods can be implemented by using ahardware integrated logical circuit in the processor 330, or by usinginstructions in a form of software. The foregoing processor 330 may be ageneral purpose processor, a digital signal processor (DSP), anapplication-specific integrated circuit (ASIC), a field-programmablegate array (FPGA) or another programmable logic device, a discrete gateor a transistor logic device, or a discrete hardware component. Theprocessor may implement or perform the methods, the steps, and logicblock diagrams that are disclosed in the embodiments of thisapplication. The general-purpose processor may be a microprocessor, orthe processor may be any conventional processor and the like. The stepsof the methods disclosed with reference to the embodiments of thisapplication may be directly performed and completed by using a hardwaredecoding processor, or may be performed and completed by using acombination of hardware and software modules in the decoding processor.The software module may be stored in a storage medium that is mature inthe art, such as a random access memory (RAM), a flash memory, aread-only memory (ROM), a programmable ROM, an electrically erasableprogrammable memory, or a sound area. The storage medium is located inthe memory 340, and the processor 330 reads information in the memory340 and completes the steps in the foregoing methods in combination withhardware of the processor. For related descriptions of FIG. 13 , referto the related descriptions and effects of the method in FIG. 3 .Details are not further described herein.

An embodiment of this application further provides a computer-readablestorage medium, the computer-readable storage medium storing a computerprogram, the computer program, when run on a computer, causing thecomputer to perform the method described in the foregoing embodiments.

An embodiment of this application further provides a computer programproduct including a program, the program, when being executed on acomputer, causing the computer to perform the method described in theforegoing embodiments.

A person skilled in the art may clearly understand that for convenienceand conciseness of description, for specific working processes of theforegoing described system, apparatus and unit, reference may be made tothe corresponding processes in the foregoing method embodiments, anddetails are not described herein.

The foregoing embodiments are merely intended for describing thetechnical solutions of this application, but not for limiting thisapplication. Although this application is described in detail withreference to the foregoing embodiments, a person of ordinary skill inthe art may understand that they may still make modifications to thetechnical solutions described in the foregoing embodiments or makeequivalent replacements to some technical features thereof, and suchmodifications or replacements will not cause the essence ofcorresponding technical solutions to depart from the spirit and scope ofthe technical solutions in the embodiments of this application. In thisapplication, the term “unit” or “module” in this application refers to acomputer program or part of the computer program that has a predefinedfunction and works together with other related parts to achieve apredefined goal and may be all or partially implemented by usingsoftware, hardware (e.g., processing circuitry and/or memory configuredto perform the predefined functions), or a combination thereof. Eachunit or module can be implemented using one or more processors (orprocessors and memory). Likewise, a processor (or processors and memory)can be used to implement one or more modules or units. Moreover, eachmodule or unit can be part of an overall module that includes thefunctionalities of the module or unit.

What is claimed is:
 1. A speech detection method performed by a computerdevice, the method comprising: obtaining sound area informationcorresponding to each sound area in N sound areas, the sound areainformation comprising a sound area identifier, a sound pointing angle,and user information, the sound area identifier being used foridentifying a sound area, the sound pointing angle being used forindicating a central angle of the sound area, the user information beingused for indicating a user existence situation in the sound area, Nbeing an integer greater than 1; using each sound area as a targetdetection sound area, and generating a control signal corresponding tothe target detection sound area according to the sound area informationcorresponding to the target detection sound area, the control signalbeing used for performing suppression or retention on a speech inputsignal corresponding to the target detection sound area; processing thespeech input signal corresponding to the target detection sound area byusing the control signal corresponding to the target detection soundarea, to obtain a speech output signal corresponding to the targetdetection sound area; and generating a speech detection result of thetarget detection sound area according to the speech output signalcorresponding to the target detection sound area.
 2. The speechdetection method according to claim 1, wherein the obtaining sound areainformation corresponding to each sound area in N sound areas comprises:detecting each sound area in the N sound areas, to obtain a userdetection result corresponding to the sound area; using the sound areaas the target detection sound area, and determining user informationcorresponding to the target detection sound area according to a userdetection result corresponding to the target detection sound area;determining lip motion information corresponding to the target detectionsound area according to the user detection result corresponding to thetarget detection sound area; obtaining a sound area identifiercorresponding to the target detection sound area and a sound pointingangle corresponding to the target detection sound area; and generatingthe sound area information corresponding to the target detection soundarea according to the user information corresponding to the targetdetection sound area, the lip motion information corresponding to thetarget detection sound area, the sound area identifier corresponding tothe target detection sound area, and the sound pointing anglecorresponding to the target detection sound area.
 3. The speechdetection method according to claim 2, wherein the determining userinformation corresponding to the target detection sound area accordingto a user detection result corresponding to the target detection soundarea comprises: determining a first identity as the user informationwhen the user detection result corresponding to the target detectionsound area is that a recognizable user exists in the target detectionsound area; determining a second identity as the user information whenthe user detection result corresponding to the target detection soundarea is that no user exists in the target detection sound area; anddetermining a third identity as the user information when the userdetection result corresponding to the target detection sound area isthat an unknown user exists in the target detection sound area; and 4.The speech detection method according to claim 2, wherein thedetermining lip motion information corresponding to the target detectionsound area according to the user detection result corresponding to thetarget detection sound area comprises: determining a first motionidentifier as the lip motion information when the user detection resultcorresponding to the target detection sound area is that a user with lipmotion exists in the target detection sound area; determining a secondmotion identifier as the lip motion information when the user detectionresult corresponding to the target detection sound area is that a userexists in the target detection sound area and the user does not have lipmotion; and determining a third motion identifier as the lip motioninformation when the user detection result corresponding to the targetdetection sound area is that no user exists in the target detectionsound area.
 5. The speech detection method according to claim 2, whereinthe generating a control signal corresponding to the target detectionsound area according to sound area information corresponding to thetarget detection sound area comprises: generating a first control signalwhen the user information corresponding to the target detection soundarea indicates that no user exists in the target detection sound area,the first control signal being used for performing suppression on thespeech input signal; and generating the first control signal when theuser information corresponding to the target detection sound areaindicates that a user exists in the target detection sound area and theuser does not have lip motion; generating a second control signal whenthe user information corresponding to the target detection sound areaindicates that a user exists in the target detection sound area and theuser has lip motion, the second control signal being used for performingretention on the speech input signal; and generating the first controlsignal or the second control signal according to an original audiosignal when the user information corresponding to the target detectionsound area indicates that a user exists in the target detection soundarea and a lip motion situation of the user is unknown.
 6. The speechdetection method according to claim 1, wherein the generating a controlsignal corresponding to the target detection sound area according tosound area information corresponding to the target detection sound areacomprises: generating a first control signal when the user informationcorresponding to the target detection sound area indicates that no userexists in the target detection sound area, the first control signalbeing used for performing suppression on the speech input signal; andgenerating a second control signal when the user informationcorresponding to the target detection sound area indicates that a userexists in the target detection sound area, the second control signalbeing used for performing retention on the speech input signal.
 7. Thespeech detection method according to claim 1, wherein the generating acontrol signal corresponding to the target detection sound areaaccording to sound area information corresponding to the targetdetection sound area comprises: generating the control signalcorresponding to the target detection sound area according to sound areainformation corresponding to the target detection sound area by using apreset algorithm, the preset algorithm being one of an adaptivebeamforming algorithm, a blind source separation algorithm, or a deeplearning-based speech separation algorithm.
 8. The speech detectionmethod according to claim 1, wherein the generating a speech detectionresult of the target detection sound area according to the speech outputsignal corresponding to the target detection sound area comprises:determining a signal power corresponding to the target detection soundarea according to the speech output signal corresponding to the targetdetection sound area, wherein the signal power is a signal power of thespeech output signal at a time-frequency point; determining an estimatedsignal-to-noise ratio corresponding to the target detection sound areaaccording to the signal power corresponding to the target detectionsound area; determining an output signal weighted value corresponding tothe target detection sound area according to the estimatedsignal-to-noise ratio corresponding to the target detection sound area,the output signal weighted value being a weighted result of the speechoutput signal at the time-frequency point; determining a target speechoutput signal corresponding to the target detection sound area accordingto the output signal weighted value corresponding to the targetdetection sound area and the speech output signal corresponding to thetarget detection sound area; and determining the speech detection resultcorresponding to the target detection sound area according to the targetspeech output signal corresponding to the target detection sound area.9. The speech detection method according to claim 8, wherein thegenerating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area comprises: generating a first speech detectionresult when the target speech output signal corresponding to the targetdetection sound area meets a human voice matching condition, the firstspeech detection result indicating that the target speech output signalis a human voice signal; and generating a second speech detection resultwhen the target speech output signal corresponding to the targetdetection sound area does not meet the human voice matching condition,the second speech detection result indicating that the target speechoutput signal is a noise signal.
 10. The speech detection methodaccording to claim 1, wherein after the generating a speech detectionresult of the target detection sound area according to the speech outputsignal corresponding to the target detection sound area, the methodfurther comprises: determining a target sound area from the N soundareas when the speech detection results corresponding to the N soundareas indicate that the speech output signal is a human voice signal;and transmitting the speech output signal corresponding to the targetsound area to a calling party.
 11. The speech detection method accordingto claim 1, wherein after the generating a speech detection result ofthe target detection sound area according to the speech output signalcorresponding to the target detection sound area, the method furthercomprises: determining a target sound area from the N sound areas whenthe speech detection results corresponding to the N sound areasindicates that the speech output signal is a human voice signal; andperforming semantic recognition on the speech output signalcorresponding to the target sound area, to obtain a semantic recognitionresult; and generating dialog response information according to thesemantic recognition result.
 12. The speech detection method accordingto claim 1, wherein after the generating a speech detection result ofthe target detection sound area according to the speech output signalcorresponding to the target detection sound area, the method furthercomprises: determining a target sound area from the N sound areas whenthe speech detection results corresponding to the N sound areas indicatethat the speech output signal is a human voice signal; and performingsegmentation processing on the speech output signal corresponding to thetarget sound area, to obtain audio data corresponding to the targetsound area; performing speech recognition on the audio datacorresponding to the target sound area, to obtain a speech recognitionresult corresponding to the target sound area; and generating textrecord information according to the speech recognition resultcorresponding to the target sound area, the text record informationcomprising at least one of translation text or conference record text.13. A computer device, comprising: a memory, a transceiver, a processor,and a bus system, the memory being configured to store a program, thebus system being configured to connect the memory and the processor, tocause the memory and the processor to perform communication, and theprocessor being configured to execute the program in the memory, andperform a speech detection method comprising: obtaining sound areainformation corresponding to each sound area in N sound areas, the soundarea information comprising a sound area identifier, a sound pointingangle, and user information, the sound area identifier being used foridentifying a sound area, the sound pointing angle being used forindicating a central angle of the sound area, the user information beingused for indicating a user existence situation in the sound area, Nbeing an integer greater than 1; using each sound area as a targetdetection sound area, and generating a control signal corresponding tothe target detection sound area according to the sound area informationcorresponding to the target detection sound area, the control signalbeing used for performing suppression or retention on a speech inputsignal corresponding to the target detection sound area; processing thespeech input signal corresponding to the target detection sound area byusing the control signal corresponding to the target detection soundarea, to obtain a speech output signal corresponding to the targetdetection sound area; and generating a speech detection result of thetarget detection sound area according to the speech output signalcorresponding to the target detection sound area.
 14. The computerdevice according to claim 13, wherein the obtaining sound areainformation corresponding to each sound area in N sound areas comprises:detecting each sound area in the N sound areas, to obtain a userdetection result corresponding to the sound area; using the sound areaas the target detection sound area, and determining user informationcorresponding to the target detection sound area according to a userdetection result corresponding to the target detection sound area;determining lip motion information corresponding to the target detectionsound area according to the user detection result corresponding to thetarget detection sound area; obtaining a sound area identifiercorresponding to the target detection sound area and a sound pointingangle corresponding to the target detection sound area; and generatingthe sound area information corresponding to the target detection soundarea according to the user information corresponding to the targetdetection sound area, the lip motion information corresponding to thetarget detection sound area, the sound area identifier corresponding tothe target detection sound area, and the sound pointing anglecorresponding to the target detection sound area.
 15. The computerdevice according to claim 13, wherein the generating a control signalcorresponding to the target detection sound area according to sound areainformation corresponding to the target detection sound area comprises:generating a first control signal when the user informationcorresponding to the target detection sound area indicates that no userexists in the target detection sound area, the first control signalbeing used for performing suppression on the speech input signal; andgenerating a second control signal when the user informationcorresponding to the target detection sound area indicates that a userexists in the target detection sound area, the second control signalbeing used for performing retention on the speech input signal.
 16. Thecomputer device according to claim 13, wherein the generating a controlsignal corresponding to the target detection sound area according tosound area information corresponding to the target detection sound areacomprises: generating the control signal corresponding to the targetdetection sound area according to sound area information correspondingto the target detection sound area by using a preset algorithm, thepreset algorithm being one of an adaptive beamforming algorithm, a blindsource separation algorithm, or a deep learning-based speech separationalgorithm.
 17. The computer device according to claim 13, wherein thegenerating a speech detection result of the target detection sound areaaccording to the speech output signal corresponding to the targetdetection sound area comprises: determining a signal power correspondingto the target detection sound area according to the speech output signalcorresponding to the target detection sound area, wherein the signalpower is a signal power of the speech output signal at a time-frequencypoint; determining an estimated signal-to-noise ratio corresponding tothe target detection sound area according to the signal powercorresponding to the target detection sound area; determining an outputsignal weighted value corresponding to the target detection sound areaaccording to the estimated signal-to-noise ratio corresponding to thetarget detection sound area, the output signal weighted value being aweighted result of the speech output signal at the time-frequency point;determining a target speech output signal corresponding to the targetdetection sound area according to the output signal weighted valuecorresponding to the target detection sound area and the speech outputsignal corresponding to the target detection sound area; and determiningthe speech detection result corresponding to the target detection soundarea according to the target speech output signal corresponding to thetarget detection sound area.
 18. A non-transitory computer-readablestorage medium comprising instructions that, when executed by aprocessor of a computer device, cause the computer device to perform aspeech detection method including: obtaining sound area informationcorresponding to each sound area in N sound areas, the sound areainformation comprising a sound area identifier, a sound pointing angle,and user information, the sound area identifier being used foridentifying a sound area, the sound pointing angle being used forindicating a central angle of the sound area, the user information beingused for indicating a user existence situation in the sound area, Nbeing an integer greater than 1; using each sound area as a targetdetection sound area, and generating a control signal corresponding tothe target detection sound area according to the sound area informationcorresponding to the target detection sound area, the control signalbeing used for performing suppression or retention on a speech inputsignal corresponding to the target detection sound area; processing thespeech input signal corresponding to the target detection sound area byusing the control signal corresponding to the target detection soundarea, to obtain a speech output signal corresponding to the targetdetection sound area; and generating a speech detection result of thetarget detection sound area according to the speech output signalcorresponding to the target detection sound area.
 19. The non-transitorycomputer-readable storage medium according to claim 18, wherein theobtaining sound area information corresponding to each sound area in Nsound areas comprises: detecting each sound area in the N sound areas,to obtain a user detection result corresponding to the sound area; usingthe sound area as the target detection sound area, and determining userinformation corresponding to the target detection sound area accordingto a user detection result corresponding to the target detection soundarea; determining lip motion information corresponding to the targetdetection sound area according to the user detection resultcorresponding to the target detection sound area; obtaining a sound areaidentifier corresponding to the target detection sound area and a soundpointing angle corresponding to the target detection sound area; andgenerating the sound area information corresponding to the targetdetection sound area according to the user information corresponding tothe target detection sound area, the lip motion informationcorresponding to the target detection sound area, the sound areaidentifier corresponding to the target detection sound area, and thesound pointing angle corresponding to the target detection sound area.20. The non-transitory computer-readable storage medium according toclaim 18, wherein the generating a control signal corresponding to thetarget detection sound area according to sound area informationcorresponding to the target detection sound area comprises: generating afirst control signal when the user information corresponding to thetarget detection sound area indicates that no user exists in the targetdetection sound area, the first control signal being used for performingsuppression on the speech input signal; and generating a second controlsignal when the user information corresponding to the target detectionsound area indicates that a user exists in the target detection soundarea, the second control signal being used for performing retention onthe speech input signal.